R version 4.1.1 “Kick Things”

library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(GGally)
library(gridExtra)
library(plotly)

Load dataset into R

Explore Suicides dataset

This is the initial dataset format when R loads it into its program, but we need to change the type of several of these to a factor (categorical) format.

##    country               year          sex                age           
##  Length:27820       Min.   :1985   Length:27820       Length:27820      
##  Class :character   1st Qu.:1995   Class :character   Class :character  
##  Mode  :character   Median :2002   Mode  :character   Mode  :character  
##                     Mean   :2001                                        
##                     3rd Qu.:2008                                        
##                     Max.   :2016                                        
##                                                                         
##   suicides_no      population       suicides/100k pop country-year      
##  Min.   :    0   Min.   :     278   Min.   :  0.00    Length:27820      
##  1st Qu.:    3   1st Qu.:   97498   1st Qu.:  0.92    Class :character  
##  Median :   25   Median :  430150   Median :  5.99    Mode  :character  
##  Mean   :  243   Mean   : 1844794   Mean   : 12.82                      
##  3rd Qu.:  131   3rd Qu.: 1486143   3rd Qu.: 16.62                      
##  Max.   :22338   Max.   :43805214   Max.   :224.97                      
##                                                                         
##   HDI for year   gdp_for_year ($)   gdp_per_capita ($)  generation       
##  Min.   :0       Min.   :4.69e+07   Min.   :   251     Length:27820      
##  1st Qu.:1       1st Qu.:8.99e+09   1st Qu.:  3447     Class :character  
##  Median :1       Median :4.81e+10   Median :  9372     Mode  :character  
##  Mean   :1       Mean   :4.46e+11   Mean   : 16866                       
##  3rd Qu.:1       3rd Qu.:2.60e+11   3rd Qu.: 24874                       
##  Max.   :1       Max.   :1.81e+13   Max.   :126352                       
##  NA's   :19456

Update format with factor and numeric variables

  • 3 categorical variables
    • country
    • sex
    • age
  • 4 numeric
    • suicides
    • population
    • suicides per100k`
    • gdp per capita
  • 1 time-series
    • year
##         country           year          sex           age        suicides_no   
##  Austria    :  382   Min.   :1985   female:13910   5-14 :4610   Min.   :    0  
##  Iceland    :  382   1st Qu.:1995   male  :13910   15-24:4642   1st Qu.:    3  
##  Mauritius  :  382   Median :2002                  25-34:4642   Median :   25  
##  Netherlands:  382   Mean   :2001                  35-54:4642   Mean   :  243  
##  Argentina  :  372   3rd Qu.:2008                  55-74:4642   3rd Qu.:  131  
##  Belgium    :  372   Max.   :2016                  75+  :4642   Max.   :22338  
##  (Other)    :25548                                                             
##    population       suicides_p100k   gdp_per_capita  
##  Min.   :     278   Min.   :  0.00   Min.   :   251  
##  1st Qu.:   97498   1st Qu.:  0.92   1st Qu.:  3447  
##  Median :  430150   Median :  5.99   Median :  9372  
##  Mean   : 1844794   Mean   : 12.82   Mean   : 16866  
##  3rd Qu.: 1486143   3rd Qu.: 16.62   3rd Qu.: 24874  
##  Max.   :43805214   Max.   :224.97   Max.   :126352  
## 

Data Summary

  • No NAs, thus no missing values
  • Each of the 4 numeric variables have a heavy right skew (mean > median):
    • We can see this more clearly visually with histograms (see next page)
    • Log transformations are required to see variable distributions

Head / Tail of Data

Nothing out of the ordinary here. It seems like we read in the whole file and do not need to skip any header or footer miscellaneous data.

## # A tibble: 6 x 8
##   country  year sex   age   suicides_no population suicides_p100k gdp_per_capita
##   <fct>   <dbl> <fct> <fct>       <dbl>      <dbl>          <dbl>          <dbl>
## 1 Albania  1987 male  15-24          21     312900           6.71            796
## 2 Albania  1987 male  35-54          16     308000           5.19            796
## 3 Albania  1987 fema… 15-24          14     289700           4.83            796
## 4 Albania  1987 male  75+             1      21800           4.59            796
## 5 Albania  1987 male  25-34           9     274300           3.28            796
## 6 Albania  1987 fema… 75+             1      35600           2.81            796
## # A tibble: 6 x 8
##   country  year sex   age   suicides_no population suicides_p100k gdp_per_capita
##   <fct>   <dbl> <fct> <fct>       <dbl>      <dbl>          <dbl>          <dbl>
## 1 Uzbeki…  2014 fema… 25-34         162    2735238           5.92           2309
## 2 Uzbeki…  2014 fema… 35-54         107    3620833           2.96           2309
## 3 Uzbeki…  2014 fema… 75+             9     348465           2.58           2309
## 4 Uzbeki…  2014 male  5-14           60    2762158           2.17           2309
## 5 Uzbeki…  2014 fema… 5-14           44    2631600           1.67           2309
## 6 Uzbeki…  2014 fema… 55-74          21    1438935           1.46           2309

Further investigate high suicide counts

Sorting by suicide counts descending tells us that Russian men age 35-54 had the highest suicide raw counts around the 1990s to early 2000s, but this is not scaled by population yet. If we instead look at suicides per 100k persons, will the same trend appear?

## # A tibble: 1,467 x 8
##    country             year sex   age   suicides_no population suicides_p100k
##    <fct>              <dbl> <fct> <fct>       <dbl>      <dbl>          <dbl>
##  1 Russian Federation  1994 male  35-54       22338   19044200          117. 
##  2 Russian Federation  1995 male  35-54       21706   19249600          113. 
##  3 Russian Federation  2001 male  35-54       21262   21476420           99  
##  4 Russian Federation  2000 male  35-54       21063   21378098           98.5
##  5 Russian Federation  1999 male  35-54       20705   21016400           98.5
##  6 Russian Federation  1996 male  35-54       20562   19507100          105. 
##  7 Russian Federation  1993 male  35-54       20256   18908000          107. 
##  8 Russian Federation  2002 male  35-54       20119   21320535           94.4
##  9 Russian Federation  1997 male  35-54       18973   19913400           95.3
## 10 Russian Federation  2003 male  35-54       18681   21007346           88.9
## # … with 1,457 more rows, and 1 more variable: gdp_per_capita <dbl>

Suicides per 100k persons

Russian men are no longer at the top of list, so it may have been due to their large population that so many suicides occurred. In both of these lists I see only male persons that are older as well, so maybe sex or age plays a factor here.

## # A tibble: 27,820 x 8
##    country            year sex   age   suicides_no population suicides_p100k
##    <fct>             <dbl> <fct> <fct>       <dbl>      <dbl>          <dbl>
##  1 Aruba              1995 male  75+             2        889           225.
##  2 Seychelles         2006 male  75+             2        976           205.
##  3 Suriname           2012 male  75+            10       5346           187.
##  4 Republic of Korea  2011 male  75+          1276     688365           185.
##  5 Republic of Korea  2010 male  75+          1152     631853           182.
##  6 Hungary            1992 male  75+           317     178482           178.
##  7 Hungary            1993 male  75+           300     168944           178.
##  8 Hungary            1991 male  75+           333     188235           177.
##  9 Republic of Korea  2005 male  75+           780     442349           176.
## 10 Hungary            1994 male  75+           292     165660           176.
## # … with 27,810 more rows, and 1 more variable: gdp_per_capita <dbl>
##     0%    10%    20%    30%    40%    50%    60%    70%    80%    90%   100% 
##   0.00   0.00   0.41   1.60   3.54   5.99   9.09  13.56  20.53  33.29 224.97

Suicides per 100k persons vs Age by Sex

Does sex or age contribute to historic suicide rates? Yes, it does seem like suicides are more common among men than women and elderly vs young folks, though this plot contains 30 years of data for over 100 countries.

Variable Distributions

I first tried a regular histogram for the variable suicide_no, but there is such a pronounced right skew that applying a log to the x axis made sense. I believe population, suicides per 100k, and gdp per capita might have a similar problem in viewing their distributions with no transformations applied.

I created a log histogram and density plot function to help in making these 4 plots. The get() function can give ggplot the correct column name from an input “string” column name. If I was using Python, I would have created a dictionary and looped over it to get the column and axis labels, but R makes this process harder.